1 Inference: Motivating Example

Thus far, our analyses have been exploratory in nature. Consider an example. Scientists are developing a COVID test. Doctors gave both the new test and the “gold standard” to a group of 100 patients with COVID. The new test was accurate for 88 of these patients whereas the standard test was only accurate for 86.

  • Exploratory question
    What trends did we observe in our sample of data? Was the new test more accurate than the old in detecting COVID among the 100 patients?

  • Inferential question
    Do these data provide enough evidence to conclude that the new test is better at detecting COVID?

The point: we’ll start looking beyond just the trends in our sample data and exploring how to use sample data to make inferences about the larger population of interest. The first step in reaching this goal is understanding sampling distributions, a reflection of the potential error in our sample information.


2 Data Context

As part of the Justice40 Initiative, the Council on Environmental Quality (CEQ) recently launched the Climate and Economic Justice Screening Tool (CEJST). It pulls data from myriad sources, including the US Census, the Departments of Energy, Transportation, and Housing and Urban Development, the Center for Disease Control, and the Environmental Protection Agency’s Environmental Justice Screening Tool (EJScreen). By looking at data on communities (census tracts or smaller block group levels), the tools helps identify indicators of burdens in eight categories: climate change, energy, health, housing, legacy pollution, transportation, water and wastewater, and workforce development. Aligned with the same broader mission, the County Health Rankings & Roadmaps is a program of the University of Wisconsin Population Health Institute that collates data on health equity at the county level. In this activity, we’ll examine relationships at the county level between racial composition and food insecurity, data for the latter of which is incorporated from the Map the Meal Gap study from Feeding America.

Here are choropleths of our two variables of interest – percent_non_hispanic_white and percent_food_insecure – on 3142 US counties (data is missing for one county):


Next, let’s visualize the relationship between these two variables and build a linear regression model of percent_food_insecure with percent_non_hispanic_white as the lone explanatory variable:

mod1<-lm(data=county_health,formula=percent_food_insecure~1+percent_non_hispanic_white)
mod1$coefficients
##                (Intercept) percent_non_hispanic_white 
##                16.80546419                -0.05761333

So the model line is given by \[\text{percent_food_insecure} = 16.805 - 0.0576 * \text{percent_non_hispanic_white}.\]


3 Sampling Distributions

FORGET THAT YOU KNOW ALL OF THE ABOVE.

Let’s pretend that we’re working within the typical scenario - we don’t have access to the entire population of interest. Instead, we need to estimate population relationship using data from a randomly selected sample of counties.

3.1 Sampling and randomness in RStudio

We’ll be taking some random samples of counties throughout this activity. The underlying random number generator plays a role in the random sample we happen to get:

# Try the following chunk A FEW TIMES
sample_n(county_health, size = 2, replace = FALSE)
## # A tibble: 2 × 289
##   fips  state  county life_expectancy x95_percent_ci_low_5 x95_percent_ci_high_6
##   <chr> <chr>  <chr>            <dbl>                <dbl>                 <dbl>
## 1 22061 Louis… Linco…            76                   75.1                  76.9
## 2 29003 Misso… Andrew            79.1                 77.7                  80.4
## # ℹ 283 more variables: life_expectancy_aian <dbl>,
## #   life_expectancy_aian_95_percent_ci_low <dbl>,
## #   life_expectancy_aian_95_percent_ci_high <dbl>, life_expectancy_asian <dbl>,
## #   life_expectancy_asian_95_percent_ci_low <dbl>,
## #   life_expectancy_asian_95_percent_ci_high <dbl>,
## #   life_expectancy_black <dbl>, life_expectancy_black_95_percent_ci_low <dbl>,
## #   life_expectancy_black_95_percent_ci_high <dbl>, …
# Try the following FULL chunk A FEW TIMES
set.seed(38)
sample_n(county_health, size = 2, replace = FALSE)
## # A tibble: 2 × 289
##   fips  state  county life_expectancy x95_percent_ci_low_5 x95_percent_ci_high_6
##   <chr> <chr>  <chr>            <dbl>                <dbl>                 <dbl>
## 1 30069 Monta… Petro…            NA                   NA                    NA  
## 2 27029 Minne… Clear…            77.7                 75.7                  79.8
## # ℹ 283 more variables: life_expectancy_aian <dbl>,
## #   life_expectancy_aian_95_percent_ci_low <dbl>,
## #   life_expectancy_aian_95_percent_ci_high <dbl>, life_expectancy_asian <dbl>,
## #   life_expectancy_asian_95_percent_ci_low <dbl>,
## #   life_expectancy_asian_95_percent_ci_high <dbl>,
## #   life_expectancy_black <dbl>, life_expectancy_black_95_percent_ci_low <dbl>,
## #   life_expectancy_black_95_percent_ci_high <dbl>, …

NOTE: If we set.seed(some positive integer) before taking a random sample, we’ll get the same results. This reproducibility is important:

  • we get the same results every time we knit the Rmd
  • we can share our work with others and ensure they get our same answers
  • it would not be great if you submitted your work to, say, a journal, and weren’t able to back up / confirm / reproduce your results!

3.2 Class experiment: Different samples, different estimates

Using a class experiment, we will now explore the degree to which the sample data we happen to collect can impact our estimates and conclusions.

Exercise 3.1 (Construct your sample) Let’s each take a sample and see what we get. Let your seed be your birthday month (1 or 2 digits) and day (2 digits). For example January 5 is 105, September 10 is 910, October 5 is 1005, and December 10 is 1210. Set this in RStudio (and change it to eval=TRUE):

4 {r eval = TRUE} # set.seed(1012) #

Take a random sample of 10 counties:

# set.seed(1012)
# set.seed(8888888)
# set.seed(7777777)
# set.seed(6666666)
# set.seed(28282828)
my_sample <- sample_n(county_health, size = 10, replace = FALSE)
my_sample
## # A tibble: 10 × 289
##    fips  state county life_expectancy x95_percent_ci_low_5 x95_percent_ci_high_6
##    <chr> <chr> <chr>            <dbl>                <dbl>                 <dbl>
##  1 27069 Minn… Kitts…            77.4                 74.7                  80.2
##  2 05107 Arka… Phill…            69.5                 67.9                  71  
##  3 08011 Colo… Bent              74                   71                    76.9
##  4 05009 Arka… Boone             76.1                 75.2                  77  
##  5 17125 Illi… Mason             75.1                 73.5                  76.7
##  6 08045 Colo… Garfi…            80.6                 79.8                  81.3
##  7 55101 Wisc… Racine            77.5                 77.1                  77.9
##  8 13075 Geor… Cook              74.1                 72.8                  75.4
##  9 13283 Geor… Treut…            74.9                 72.7                  77  
## 10 48401 Texas Rusk              76.1                 75.3                  76.9
## # ℹ 283 more variables: life_expectancy_aian <dbl>,
## #   life_expectancy_aian_95_percent_ci_low <dbl>,
## #   life_expectancy_aian_95_percent_ci_high <dbl>, life_expectancy_asian <dbl>,
## #   life_expectancy_asian_95_percent_ci_low <dbl>,
## #   life_expectancy_asian_95_percent_ci_high <dbl>,
## #   life_expectancy_black <dbl>, life_expectancy_black_95_percent_ci_low <dbl>,
## #   life_expectancy_black_95_percent_ci_high <dbl>, …
ggplot(my_sample,aes(x=percent_non_hispanic_white, y=percent_food_insecure))+
  geom_point()+
  geom_smooth(method="lm",se=FALSE)

lm(percent_food_insecure ~ 1 + percent_non_hispanic_white, data=my_sample)
## 
## Call:
## lm(formula = percent_food_insecure ~ 1 + percent_non_hispanic_white, 
##     data = my_sample)
## 
## Coefficients:
##                (Intercept)  percent_non_hispanic_white  
##                    22.9959                     -0.1331

Use your sample to construct and plot a sample model. How close is your sample estimate to the actual population model (percent_food_insecure = 16.805 - 0.0576 * percent_non_hispanic_white)?

Exercise 4.1 (Report your results) Indicate your sample intercept and slope estimates in this survey.

Exercise 4.2 (Repeat experiment) Choose another random seed and generate a new random sample of 10 counties. Repeat this experiment, and fill out the Google form again.

Compare estimates

Now let’s import each student’s estimates from Google sheets:

results <- gsheet2tbl('https://docs.google.com/spreadsheets/d/1teGGCtilEH3AL4t-7CA0w3tvpIJMzg_bI0wF0xzKTOs/edit?usp=sharing')

We’ll first compare the intercepts to each other and to the true population intercept in red:

ggplot(results, aes(x = intercept)) + 
  geom_histogram(color = "white") + 
  geom_vline(xintercept = 16.805, color = "red")

And then compare the slopes to each other and to the true population slope in red:

ggplot(results, aes(x = slope)) + 
  geom_histogram(color = "white") + 
  geom_vline(xintercept = -.0576, color = "red")

Finally, here is a comparison of the resulting models to the true population model in red:

ggplot(county_health, aes(x = percent_non_hispanic_white, y = percent_food_insecure)) +
  geom_abline(data = results, aes(intercept = intercept, slope = slope), color = "gray") + 
  geom_smooth(method = "lm", color = "red", se = FALSE)

4.1 Simulation study

Our little experiment reflects very few of the more than \(_{3142}C_{10} > 2.5*10^{28}\) different samples of 10 counties that we could get from the entire population of 3142 counties!! In this section, you’ll run a simulation to study just how different these estimates could be.

Whereas sample_n() takes a single sample of size \(n\) from a dataset, rep_sample_n() takes multiple samples of size \(n\). To get a feel for it, take 4 samples of size 2. The replicate variable in the output indicates the sample (1, 2, 3, 4) to which each sampled case corresponds.

example_1 <- rep_sample_n(county_health, size = 2, reps = 4, replace = FALSE)
dim(example_1)
## [1]   8 290
example_1
## # A tibble: 8 × 290
## # Groups:   replicate [4]
##   replicate fips  state       county     life_expectancy x95_percent_ci_low_5
##       <int> <chr> <chr>       <chr>                <dbl>                <dbl>
## 1         1 28057 Mississippi Itawamba              74.2                 73.1
## 2         1 06043 California  Mariposa              80.6                 79  
## 3         2 21047 Kentucky    Christian             73.8                 73.1
## 4         2 36083 New York    Rensselaer            78.8                 78.4
## 5         3 35017 New Mexico  Grant                 76.8                 75.6
## 6         3 29221 Missouri    Washington            73.1                 71.9
## 7         4 39057 Ohio        Greene                78.5                 78  
## 8         4 21043 Kentucky    Carter                73.5                 72.4
## # ℹ 284 more variables: x95_percent_ci_high_6 <dbl>,
## #   life_expectancy_aian <dbl>, life_expectancy_aian_95_percent_ci_low <dbl>,
## #   life_expectancy_aian_95_percent_ci_high <dbl>, life_expectancy_asian <dbl>,
## #   life_expectancy_asian_95_percent_ci_low <dbl>,
## #   life_expectancy_asian_95_percent_ci_high <dbl>,
## #   life_expectancy_black <dbl>, life_expectancy_black_95_percent_ci_low <dbl>,
## #   life_expectancy_black_95_percent_ci_high <dbl>, …

Exercise 4.3 (500 samples of size 10)

  1. To get a sense for the wide variety of samples we might get, take 500 samples of size \(n\) = 10. Store these as samples_10.
  2. Each sample produces a different estimate of the population model between percent_food_insecure and percent_non_hispanic_white. Use the following code to plot these 500 sample model estimates on the same frame. (Note the new group term!)
samples_10 <- rep_sample_n(county_health, size=10, reps=500, replace=FALSE)
ggplot(samples_10, aes(x = percent_non_hispanic_white, y = percent_food_insecure, group = replicate)) + 
    geom_smooth(method = "lm", se = FALSE, size = 0.5, color = "gray") 
  1. Let’s focus on the slopes of these 500 sample models. Use the following code to save the 500 percent_non_hispanic_white (slope) coefficients, stored under the estimate variable in the slopes_10 data frame. (Notice the new do() verb!)
slopes_10 <- samples_10 %>%    
    group_by(replicate) %>%     
    do(tidy(lm(percent_food_insecure ~ percent_non_hispanic_white, data = .))) %>% 
    filter(term == "percent_non_hispanic_white")
  1. Construct a histogram of the 500 sample estimates of the true slope. This histogram approximates a sampling distribution of the sample slopes. Set your x-axis limits to [-0.2,0.2]. Add a vertical red line at x=-0.0576, the slope of the population model we computed above.
ggplot(slopes_10, aes(x = estimate)) + 
  geom_histogram(color = "white") + 
  geom_vline(xintercept = -0.0576, color = "red") +
  xlim(-0.2, 0.2)
  1. Describe the sampling distribution: What’s its general shape? Where is it centered? Roughly what is its spread? i.e., what is the range of estimates you observed?

Solution. Generally normal curve shaped, centered at the actual slope, spread seems to be +-0.2

Exercise 4.4 (Increasing sample size) Suppose we increased our sample size from n = 10 to n = 50. What impact do you anticipate this having on the sampling distribution of sample slopes:

  • Around what value would you expect the distribution of sample slopes to be centered?
  • What general shape would you expect the distribution to have?
  • In comparison to estimates based on the samples of size 10, do you think the estimates based on samples of size 50 will be closer to or farther from the true slope (on average)? Why?

Solution.

  • It would still be distributed around the actual value
  • A tighter bell curve
  • Closer, because outliers would have less of an effect on the average

Exercise 4.5 (Test your intuition)

  1. Repeat this entire simulation process with 500 samples of size n=50. Use set.seed(38) and the same naming conventions as above (e.g., samples_50 and slopes_50. Plot all 500 model lines on the same graph, extract the 500 slopes, and then generate a histogram of the resulting slope coefficients.
  2. Repeat part (a) with 500 samples of n=200.
set.seed(38)
samples_50 <- rep_sample_n(county_health, size=50, reps=500, replace=FALSE)

ggplot(samples_50, aes(x = percent_non_hispanic_white, y = percent_food_insecure, group = replicate)) + 
    geom_smooth(method = "lm", se = FALSE, size = 0.5, color = "gray") 

slopes_50 <- samples_50 %>%    
    group_by(replicate) %>%     
    do(tidy(lm(percent_food_insecure ~ percent_non_hispanic_white, data = .))) %>% 
    filter(term == "percent_non_hispanic_white")

ggplot(slopes_50, aes(x = estimate)) + 
  geom_histogram(color = "white") + 
  geom_vline(xintercept = -0.0576, color = "red") +
  xlim(-0.2, 0.2)

set.seed(38)
samples_200 <- rep_sample_n(county_health, size=200, reps=500, replace=FALSE)

ggplot(samples_200, aes(x = percent_non_hispanic_white, y = percent_food_insecure, group = replicate)) + 
    geom_smooth(method = "lm", se = FALSE, size = 0.5, color = "gray") 

slopes_200 <- samples_200 %>%    
    group_by(replicate) %>%     
    do(tidy(lm(percent_food_insecure ~ percent_non_hispanic_white, data = .))) %>% 
    filter(term == "percent_non_hispanic_white")

ggplot(slopes_200, aes(x = estimate)) + 
  geom_histogram(color = "white") + 
  geom_vline(xintercept = -0.0576, color = "red") +
  xlim(-0.2, 0.2)

Exercise 4.6 (Impact of the sample size: pictures) Compare the sampling distributions of the sample slopes for the estimates based on sizes 10, 50, and 200 by plotting them on the same frame, using the following code:

# Combine the estimates & sample size into a new data set
simulation_data <- data.frame(
  estimates = c(slopes_10$estimate, slopes_50$estimate, slopes_200$estimate), 
  sample_size = rep(c("10", "50", "200"), each = 500))

# Construct density plot
ggplot(simulation_data, aes(x = estimates, color = sample_size)) + 
  geom_density() + 
  labs(title = "SAMPLING Distributions")

Exercise 4.7 (Impact of the sample size: properties of the sampling distribution)

  1. From the simulation_data, calculate the mean and standard deviation in sample slopes calculated from samples of size 10, 50, and 200. HINT: group_by().
  2. Compare the means.
  3. Compare and interpret the standard deviations. NOTE: We call the standard deviation “standard error” here – it indicates the typical distance of a sample estimate from the actual population value.
simulation_data %>%
  group_by(sample_size) %>% 
  summarise(mean(estimates), sd(estimates))

Solution.

  1. the means are all very similar, but with sample size 10 being the farthest from the actual value and sample size 200 being closest
  2. sample size 10 has the greatest standard error, and sample size 200 has the least

Exercise 4.8 (Properties of sampling distributions) In light of your these investigations, complete the following statements.

  1. For all sample sizes, the shape of the sampling distribution is normal.
  2. As sample size increases:
    The average sample slope estimate IS FAIRLY STABLE.
    The standard deviation of the sample slopes DECREASES
  3. Thus, as sample size increases, our sample slopes become MORE RELIABLE

4.2 Summary of sampling distributions

Consider a simple population model

\[y = \beta_0 + \beta_1 x.\]

In general, we don’t know \(\beta_0\) or \(\beta_1\). We are either working with (\(x,y\)) data on a sample of subjects from the broader population of interest or with a population that is in flux. Thus, our sample data give us an estimate of the population model:

\[y = \hat{\beta}_0 + \hat{\beta}_1 x\]

What we know about the sample model:

  • Our estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) will vary depending upon what data we happen to get
  • The sample provides an estimate of the true model
  • There is error in this estimate
  • We need a sense of the potential error in order to…
    • use our sample to make inferences about the broader population
    • ethically and rigorously communicate the potential limitations of our work

These concepts are captured in the sampling distribution of a sample estimate \(\hat{\beta}\) (eg: \(\hat{\beta}_0\) or \(\hat{\beta_1}\)). Specifically, the sampling distribution of \(\hat{\beta}\) is a distribution of all possible \(\hat{\beta}\) we could observe based on all possible samples of the same size \(n\) from the population. It captures how \(\hat{\beta}\) can vary from sample to sample.

Interpretation warning

Note the differences between

  1. the population distribution
  2. the distribution of individual cases in a single sample (should be reflective of the population distribution for a large enough sample)
  3. the sampling distribution, which reflects how sample statistics (e.g., the mean, a regression coefficient) vary from sample to sample (not from individual case to individual case)

Impact of sample size on a sampling distribution
As the sample size n increases:

  • there is less variability (and more consistency) in the possible estimates from different samples
  • we are less likely to get estimates that are far from the truth

Below are sampling distributions of the sample slopes and sample models calculated from sample sizes of 10, 50, and 200 counties.

Standard error of \(\hat{\beta}\)

The standard deviation of the sampling distribution, which is called the standard error, measures the typical error in the sample slopes calculated from sample to sample. The greater the standard error, the less “reliable” the sample estimate. In linear algebraic notation, we can calculate standard errors as follows:

\[\begin{array}{ll} \text{population model}: & y = X \beta + \varepsilon \;\; \text{ where } \;\; \varepsilon \sim N(0, \sigma^2) \\ \text{sample estimate of $\beta$}: & \hat{\beta} = \left(X^TX\right)^{-1}X^Ty \\ \text{standard error of $\hat{\beta}$}: & s.e.\left(\hat{\beta}\right) = \sqrt{\sigma^2\left(X^TX\right)^{-1}} \propto \frac{\sigma}{\sqrt{n}} \\ \end{array},\]

where \(\sigma^2\) is the variance of the residuals. When the linear regression model assumptions are met, the standard error \(\hat{\beta}\) is proportional to \(\frac{1}{\sqrt{n}}\) where \(n\) is the sample size. Thus standard error decreases as sample size \(n\) increases.

4.3 Approximating Sampling Distributions

The simulation above was just pretend. In theory, the sampling distribution would provide a sense of the potential error. But in practice, we can’t actually observe the sampling distribution: we take only one sample from the population, not, say, 500 or all possible samples from the population! Thus, we can’t actually observe the sampling distribution.

An approximation of the potential error in our sample estimates is the best we can do! There are two main approaches to this approximation:

  • Central Limit Theorem
    Approximates the sampling distribution and standard error by assuming it follows a specific probability model.

  • Bootstrapping
    Does not make any probability model assumptions. Approximates the standard error and features of the sampling distribution by resampling the sample.

4.3.1 Central Limit Theorem

IF the regression model assumptions are met and our sample size \(n\) is “big enough”, then the Central Limit Theorem (CLT) guarantees that the sampling distribution of all possible \(\hat{\beta}_i\) we could observe from all possible samples of size \(n\) is approximately normally distributed around \(\beta_i\) with standard error \(s.e.(\hat{\beta}_i)\).

\[\hat{\beta}_i \stackrel{\cdot}{\sim} N(\beta_i, s.e.(\hat{\beta}_i)).\]

This statement assumes we knew the actual population value \(\beta_i\) and the actual standard error of the regression coefficient; in practice, we do not know either.

Connecting this to the 68-95-99.7 Rule, the CLT guarantees that (approximately)…

  • 68% of samples will produce \(\hat{\beta}_i\) within 1 st. err. of \(\beta_i\)

  • 95% of samples will produce \(\hat{\beta}_i\) within 2 st. err. of \(\beta_i\)

  • 99.7% of samples will produce \(\hat{\beta}_i\) within 3 st. err. of \(\beta_i\)


Aside: See this video for the CLT explained using bunnies.


Example 4.1 (Approximating the sampling distribution of a regression coefficient) Assume we have a single sample of size n=50 from the population of counties:

set.seed(1)    
one_sample_50 <- sample_n(county_health, size = 50, replace = FALSE)

# Estimate the model
one_sample_model <- lm(percent_food_insecure ~ 1+percent_non_hispanic_white, one_sample_50)
summary(one_sample_model)$coefficients
##                               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)                17.29607549 1.82195218  9.493156 1.351740e-12
## percent_non_hispanic_white -0.06271587 0.02492594 -2.516089 1.526461e-02
# Plot the model estimate
ggplot(one_sample_50, aes(x = percent_non_hispanic_white, y = percent_food_insecure)) + 
    geom_point() + 
    geom_smooth(method = "lm", se = FALSE)

In our crystal ball simulation study of 500 samples (Exercise 2.8) from the population, we calculated the standard error of \(\hat{\beta}_1\) to be around 0.0278. In reality, we can estimate this standard error using only our one_sample_50 data and a complicated formula \(c / \sqrt{n} = c / \sqrt{50}.\) The estimated Std. Error for \(\hat{\beta}_1\) shown in the one_sample_model summary table is 0.0249. Based on one_sample_50 alone, the CLT states that the distribution of \(\hat{\beta}_1\) is approximately normal: \[\hat{\beta}_1 \stackrel{\cdot}{\sim} N\left(\beta_1, 0.0249^2\right),\] where 0.0249 was the standard error estimate reported in the model summary table. In pictures:

Exercise 4.9 (CLT and the 68-95-99.7 rule) Use the CLT assumption to answer the following questions. For approximately what % of samples will \(\hat{\beta}_1\) be…

  1. too big: bigger than \(\beta_1\)
  2. lucky: within 2 standard errors (0.0499) of \(\beta_1\)
  3. unlucky: more than 2 standard errors (0.0499) from \(\beta_1\)
  4. really unlucky: more than 3 standard errors (0.0748) from \(\beta_1\)

Technical note: Typically, we need to estimate the standard error from our sample via some complicated formula \(c / \sqrt{n}\). It turns out that when we use this estimation, the distribution does not actually follow the normal distribution, but the Student’s t-distribution, which is similar to the normal with fatter tails.

4.3.2 Bootstrapping

We won’t discuss bootstrapping in detail, but here is the main idea.

Starting with our sample of size \(n\):

  1. Take many (e.g., 1000) REsamples of size \(n\) WITH REPLACEMENT from the sample. Some cases might appear multiple times, some not at all. This is analogous to seeing several similar cases in a sample from the population.

  2. Calculate \(\hat{\beta}\) using each sample. This gives 1000 estimates: \[\hat{\beta}^{(1)}, ..., \hat{\beta}^{(1000)}\] The distribution of these is our bootstrap distribution. The shape and spread of this distribution (though not the location) will mimic that of the sampling distribution.

  3. Use the bootstrap distribution to construct confidence intervals and hypothesis tests for \(\beta\) WITHOUT assuming any theory about the distribution, shape, spread of the sampling distribution.

In pictures:

or from Fig. 5.4 of Statistical Modeling by Daniel Kaplan:

If interested, you can read more about bootstrapping in Modern Data Science with R or Statistical Modeling.


5 Confidence Intervals

Never trust a sample estimate / regression coefficient that’s reported without a measure of potential error! When interpreting and reporting sample estimates (means, model coefficients, model predictions, etc), it is crucial to incorporate the potential error in these estimates. This is commonly done using confidence intervals.

The confidence interval for some population quantity of interest \(b\) (eg: a model coefficient, a mean, etc)…
- reflects potential error in the sample estimate of \(b\)
- provides a range of plausible values for \(b\)
- can be constructed by \[\text{estimate} \pm \text{margin of error} = \text{estimate} ± c * \text{standard error of estimate}\]

Conf. level c Conf Interval Picture
68% 1 estimate \(\pm\) 1 standard error –*–
95% 2 estimate \(\pm\) 2 standard errors – –*– –
99.7% 3 estimate \(\pm\) 3 standard errors – – –*– – –

In R, we can compute confidence intervals as follows: confint(model_object, level = 0.99), where you can change the level to any number you’d like.


5.1 Example: CIs for percent_food_insecure vs. percent_non_hispanic_white (\(n=100\))

In each facet below, we perform 100 repetitions of sampling n=100 counties from county_health, use those samples to construct the model percent_food_insecure ~ 1 + percent_non_hispanic_white, approximate the standard error of the sampling distribution, and plot the corresponding confidence intervals:

We can redo the same plot, coloring each confidence interval by whether it was “lucky” to include the model coefficient (slope) for the model computed on the entire population of 3142 counties:

We see that a higher confidence level leads to wider CIs and more CIs that cover the population model coefficient.

5.2 Interpreting a (95%) confidence interval

  • The “rough” but accepted way:
    We are “95% confident” that the true population value is between the lower and upper bounds of this CI.

  • The longer, technically correct way:
    Using this CI method, approximately 95% of all possible samples will produce 95% CI’s that cover the true value. The other 5% are based on unlucky samples that produce unusually low or high estimates.

  • The INCORRECT way:
    We cannot say, “There’s a 95% chance that the true population value is in the 95% CI.” Technically, the population value is either in the interval or not, so the probability is simply 1 or 0.

  • The 95% describes a quality of the procedure rather than the specific interval we obtained

  • Confidence intervals are about the precision of an estimate, not about individual cases

5.3 Practice

Exercise 5.1 Let’s take a random sample of 200 counties and use it build a model of percent_food_insecure ~ 1 + percent_non_hispanic_white + percent_rural

set.seed(1000)
samp200<-sample_n(county_health, size = 200, replace = FALSE)
mod2<-lm(data=samp200,formula=percent_food_insecure ~ 1 + percent_non_hispanic_white + percent_rural)
coef(summary(mod2))
##                               Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept)                16.62173039 0.873835230 19.021584 1.761778e-46
## percent_non_hispanic_white -0.08551150 0.011265002 -7.590900 1.241243e-12
## percent_rural               0.03531017 0.006897597  5.119199 7.270530e-07
  1. Interpret the coefficient of percent_non_hispanic_white (-0.0855). Don’t forget to control for percent_rural in your interpretation.

  2. In the model summary - summary(mod2) - R reports the standard error for each coefficient. In particular, you’ll find the standard error of the percent_non_hispanic_white coefficient is reported to be 0.011265 (make sure that you see where this number is coming from in the R output). Use this standard error to compute and interpret an approximate 95% CI for the true model coefficient (on the whole population).

  3. Use R’s built-in confint command to compute a 95% CI on this coefficient. How does it compare to the one you computed above? Are they the same? Are they close? (“Yes” or “No” for each question will suffice.)

  4. We’d like to know if this CI provides any evidence for a relationship between percent_non_hispanic_white and percent_food_insecure, when controlling for percent_rural. What would the percent_non_hispanic_white coefficient of mod2 be if there were truly NO significant relationship between percent_non_hispanic_white and percent_food_insecure, when controlling for percent_rural?

  5. If the percent_non_hispanic_white coefficient for the model on the entire population was actually equal to 0, how likely are we to have seen a coefficient of -0.0855 in our sample of 200 counties? This value would be 7.591 standard errors below the actual value of 0 (re-examine the t-value in the mod2 summary table above to understand how it is computed). The probability of a choosing a sample of 200 counties that led to a sample model coefficient 7.591 standard errors or more from the population model coefficient is given in the final column of that summary table. [There isn’t anything to answer for this question; just make sure you are comfortable with that logic.]

Solution.

  1. For every additional percent non hispanic white that a county is, the county is estimated to be 0.0855% less food insecure
  2. -0.108 to -0.063 (-0.08551150 - 2 * 0.011265002 to -0.08551150 + 2 * 0.011265002)
  3. not the same but very close
  4. 0

Exercise 5.2 Let’s think about the confidence interval procedure more generally.

  1. Is a 99% confidence interval wider or narrower than a 95% confidence interval?
  2. What are the bounds for a 100% confidence interval?
  3. The higher the confidence level, the better! Do you agree? Why or why not?

Solution.

  1. wider
  2. -inifinty to +inifity
  3. no, you need to choose a confidence interval based on your schenario, a CI too big might include too many things